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ABSTRACT 

In  this  paper,  we  consider  an  optimization  problem  that  arises  in  the  execution  of 
parallel  programs  on  shared  memory  multiple-instruction  stream  multiple-data 
stream  (MIMD)  computers.  A  program  on  such  a  machine  consists  of  many 
program  segments  each  executed  sequentially  by  a  single  processor.  The  processors 
have  access  to  shared  memory,  and  can  execute  standard  memory  access  operations 
on  this  shared  memory.  This  memory  is  distributed  among  many  separate  memory 
modules.  A  network  connects  processors  to  memory  modules.  Delays  on  this 
network  are  stochastic.  Thus,  operations  issued  by  a  processor  to  distinct  memory 
modules  may  not  be  executed  as  memory  requests  on  those  modules  in  the  order 
they  were  issued. 

For  performance  reasons,  we  want  to  allow  one  operation  to  begin  before  a 
previous  one  in  the  same  instruction  Our  analysis  gives  a  method  for  determining 
which  operations  in  a  stream  may  be  issued  concurrently  without  changing  the 
semantics  of  the  execution.  We  also  consider  code  where  blocks  of  operations  have 
to  be  executed  atomically.  This  introduces  the  necessity  of  locks.  We  use  a 
conflict  graph  similar  to  that  used  to  schedule  transactions  in  distributed  databases. 
Our  graph  incorporates  the  order  on  operations  given  by  the  program  text, 
enabling  us  to  do  without  locks  even  when  database  conflict  graphs  would  suggest 
that  locks  are  necessary. 

1.   Introduction 

Programs  on  shared  memory  MIMD  computers,  e.g.  the  NYU  Ultracomputer  [GGK] 
or  the  IBM  RP3  machine  [PBG],  consist  of  many  program  segments  each  executed 
sequentially  by  a  single  processor.  The  memory  locations  accessed  by  these  programs  are 
either  locations  in  shared  memory  modules,  or  local  memory  locations,  which  we  call 
registers.   An  operation  in  a  given  processor  follows  one  of  four  basic  patterns: 

(1)  read  from  registers  (in  that  processor)  and  write  to  registers; 

(2)  read   from  registers  (in  that  processor)   and  write  to  a  single  location  in  a  shared 
memory  module; 

(3)  read  a  single  location  from  a  shared  memory  module  and  write  to  registers  (in  that 
processor); 

(4)  force  a  constant  to  either  a  register  (in  that  processor)  or  a  shared  memory  module 
location. 

We  call  a  register  or  a  location  in  a  memory  module  by  the  generic  name  variable. 
Thus,  an  operation  consists  of  a  read  of  zero  or  more  variables  followed  by  a  write  to  one 
or   more   variables.     There   are   two  restrictions:   an   operation  accesses  only  one  shared 


memory  module  location  and  an  operation  accesses  registers  only  in  the  processor  in  which 
it  is  issued.^ 

Different  program  segments  may  execute  at  vastly  different  speeds,  without  violating 
any  reasonable  criterion  of  correctness.  Thus,  the  only  semantic  correctness  requirement 
for  such  an  cjxecution  is  embodied  in  the  following  principle  [La,  LF]. 

Sequential  Consistency 

The  outcome  of  a  sequentially  consistent  computation  (or  execution)  is  as  if  all  the 
operations  were  executed  in  some  sequential  order,  with  the  constraint  that  the 
operations  of  each  individual  processor  appear  in  this  sequence  in  the  order  they  were 
executed  by  the  processor.^  That  is.  the  outcome  must  be  the  same  as  it  would  be  in 
some  interleaving  of  the  operations  such  that  if  u  precedes  v  in  a  serial  program,  then 
u  precedes  v  in  the  interleaving. 

The  computer  design  helps  us  achieve  sequential  consistency  by  ensuring  the  atomicity 
of  the  execution  of  memory  accesses:  If  two  operations  access  the  same  memory  location 
the  two  accesses  will  have  the  same  effect  as  if  they  executed  serially. 

A  technique  for  achieving  sequential  consistency  is  given  by  Lamport  in  [La]: 
Requests  for  memory  access  are  issued  by  each  processor  in  the  order  they  occur  in  its 
program  segment;  and  requests  are  serviced  by  variables  in  the  order  they  arrive.  This 
solution  is  correct  provided  that  requests  arrive  to  memory  in  the  order  they  are  issued.  If 
requests  ma\'  arrive  out  of  order  then  the  solution  fails. 

It  IS  e"--j  to  enforce  correct  arrival  order  for  memory  requests:  when  executing  each 
sequential  ]-ogram,  wait  for  each  memory  access  to  complete  (and  return  a  value  or  an 
acknowledgement)  before  issuing  the  next  one.  Whatever  execution  finally  results  must  be 
consistent  with  the  order  within  each  sequential  program. 

In  the  context  of  our  machine  description,  such  a  solution  results  in  severe  loss  of 
efficiency.  The  network  that  connects  processors  to  shared  memory  is  likely  to  have  a 
large  latency.    On  the  other  hand,  if  the  network  is  message  switched,  then  one  memory 


^  The   results  are  valid  for  operations   that  access  more  than  one  shared  memory  location,  provided  that 
shared  memory  access  is  atomic. 

^  Th.T  semantics  of  a  sequence  generalizes  the  semantics  of  individual  operations  in  the  natural  way.    Each 
operation  maps  a  state  to  a  state.    In  a  sequence,  the  /+  1st  operation  produces  a  new  state  based  on  the  state 
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request  may  be  issued  before  a  previous  one  has  completed.  For  example  a  processor  that 
accesses  two  different  memory  locations  may  not  need  to  delay  one  access  until  the  other 
one  completes.  This  allows  the  network  to  attain  a  high  throughput,  despite  the  large 
latency. 

Unfortunately,  we  can  not  detect  possible  violations  of  sequential  consistency  by 
analyzing  each  program  segment  in  isolation.  Consider,  for  example  the  following  two 
program  segments  (the  example  is  taken  from  [Co]). 

Segment  1        Segment  2 
Pi   X  :=  1;       qi   y  :=  Y; 
P2   Y  :=  1;       q2   x  :=  X; 

X  and  Y  are  module  memory  locations  shared  by  both  program  segments  and  x  and  y  are 
different  registers.  Assume  that  initially  X  =  Y  =  0.  No  interleaving  of  these  operations 
consistent  with  the  order  in  each  program  can  lead  to  a  state  where  x  =  0  and  y=l.  Indeed, 
if  x  =  0  then  operation  <72  was  executed  before  pi;  but  then  q^  should  take  effect  before  p2> 
so  that  y  =  0.  If  the  accesses  to  shared  memory  in  either  the  first  program  segment  or  in 
the  second  program  segment  are  executed  out  of  order,  then  it  is  quite  possible  to  obtain 
this  inconsistent  result  (figure  1). 

Note,  however,  that  this  is  the  only  impossible  result.  For  example,  the  access  pattern 
9l.  Pi'  Pi'  ?2'  would  yield  x=l  and  y  =  0.  This  would  be  entirely  acceptable,  since  it 
produces  the  same  results  as  the  interleaving  qi,  pi,  p2,  q-i  . 

This  example  has  shown  that  it  may  be  necessary  for  a  processor  to  ensure  that  certain 
operations  are  executed  in  the  order  they  appear  in  a  program  segment  (even  when  there 
are  no  data  dependencies  within  that  segment).  We  assume  that  a  processor  may  delay  one 
memory  access  until  acknowledgment  of  some  previous  memory  request  is  received.  This 
is  the  only  mechanism  we  allow:  there  is  no  inter-segment  communication  other  than 
through  shared  variables. 

Conflicts  may  occur  only  when  shared  variables  are  modified.  Thus,  accesses  to 
private  variables  (i.e.   variables  that  occur  in  one  program  segment  only)   cannot  cause 


produced  by  the  /th  operation. 

^  The  reader  may  observe  that  the  sequential  consistency  requirement  is  much  weaker  than  the  serialization 
principle  addressed  in  database  theory  [KP],  which  would  forbid  this  second  outcome. 
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problems.  The  same  is  true  for  accesses  to  shared  read-only  variables.  Private  and  read- 
only variables  are  easy  to  detect.  The  following  simple  policy  guarantees  correct  executior 
of  parallel  code  (see  [EGK]): 

When  an  access  to  a  shared  memory  location  is  issued,  wait  for  acknowledgement  if  a 
shared  read-v/iite  variable  is  accessed;  otherwise  issue  the  next  operation  immediately. 

We  show  in  this  paper  that  it  is  possible  to  improve  on  this  simple  poHcy  by  doing  a 
more  elaborate  analysis  of  data  dependencies.  We  give  a  characterization  of  the  minimal 
set  of  delays  that  mu-t  be  introduced  in  the  code  in  order  to  enforce  sequential  consistency. 

We  have  addressed  so  far  the  problem  of  serializing  the  execution  of  atomic  machine 
operations.  This  is  the  issue  facing  a  machine  designer.  A  high  level  language  designer  or 
a  user  face  a  different  issue:  for  them  the  sequential  consistency  principle  should  apply  to 
those  operations  in  the  high  level  language  that  are  defined  to  be  atomic.  (This  principle 
turns  out  to  be  far  closer  lo  the  principle  of  serializability  for  database  systems.) 

When  high  level  language  constructs  are  compiled  into  machine  code,  one  high  level 
atomic  operation  translates  into  possibly  several  machine  operations.  Each  such  block  of 
code  has  to  be  executed  atomically.  Delays  are  not  sufficient  to  enforce  such  atomicity.  In 
the  general  case,  when  one  "atomic"'  block  of  code  accesses  several  memory  locations,  it 
mighr  be  necessary  to  lock  those  locations    in  order  to  preserve  atomicity. 

A  straightforward  solution  is  to  lock  all  locations  accessed  by  the  operations  within  an 
atomic  blocK  before  one  starts  executing  operations  within  the  block,  and  release  the  lock 
after  all  operations  have  executed.  Locking  may  be  an  expensive  operation  and  reduces 
concurrency.    We  want  to  avoid  the  need  for  locking  to  the  greatest  possible  extent. 

We  can  improve  on  the  straightforward  >olution  by  doing  a  more  refined  data 
dependency  analysis.  The  analysis  detects  when  delays  are  sufficient  to  enforce  correct 
execution;  it  determines  a  minimal  set  of  locks  that  must  be  used. 

There  is  a  significant  similarity  between  our  analysis  techniques  and  the  "conflict 
graph  analysis"  techniques  pioneered  by  the  distributed  database  project  SDD-1  [BSR ,  BG] 
for  concurrency  control.  A  conflict  graph  is  an  undirected  graph  that  indicates  potential 
conflicts  between  'transactions"  (where  transactions  are  analogous  to  program  segments) 
in  different  classes,  where  the  class  of  a  transaction  consists  of  the  variables  it  reads  and 
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the  variables  it  writes.  Different  considerations  make  the  analyses  diverge  from  this  point: 
the  correctness  criteria  of  the  analyses  differ  (see  previous  footnote);  and  the  intended 
order  of  operations  within  a  program  is  known  in  advance  in  our  case  and  we  use  it  in  our 
analysis,  but  this  ordering  is  ignored  by  conflict  graph  analysis.  There  are  other  examples 
of  the  interplay  between  database  concurrency  control  and  synchronous  parallel  algorithms, 
e.g.    [UW].   These  merit  further  exploration. 

2.   Main  Result 
2.1.   Preliminaries 

We  consider  a  concurrent  execution  of  a  parallel  program  consisting  of  several 
straight  line  program  segments.  (Later,  we  will  generalize  these  results.)  As  noted  above, 
we  assume  that  every  operation  accesses  at  most  one  location  in  the  memory  modules.  An 
operation  that  modifies  a  memory  location  is  a  write;  an  operation  that  accesses  a  memory 
location,  but  does  not  modify  its  contents  is  a  read.  The  access  to  each  variable  is  atomic. 
(i.e.  two  accesses  to  the  same  variable  behave  as  if  they  occur  serially  in  some  order). 
However,  since  an  operation  may  access  several  variables,  all  those  accesses  together  may 
not  necessarily  be  atomic. 

An  execution  of  the  program  is  characterized  by  two  relations:  the  order  in  which  the 
operations  are  written  in  the  program  and  the  order  in  which  they  access  variables. 
Execution  order  is  relevant  only  for  operations  that  conflict,  i.e.  operations  that  access  a 
common  variable.  The  execution  is  sequentially  consistent  if  each  operation  behaves 
atomically  and  the  conflict  and  issuing  orders  are  consistent.  The  execution  order  is  not 
known  in  advance;  however,  we  do  know  in  advance  what  (unordered)  pairs  of  operations 
conflict.    The  execution  order  should  define  an  ordering  on  each  such  pair. 

Let  us  formalize  the  previous  discussion.  Two  relations  R  and  S  are  consistent  if 
R\\S  can  be  extended  to  a  total  ordering  (a  total  ordering  is  a  partial  ordering  that  orders 
every  two  elements).  A  relation  can  be  extended  to  a  total  ordering  iff  its  transitive 
closure  is  irreflexive.   Thus  R  is  consistent  with  S  iff  the  graph  of  the  relation  R\^S  has  no 

cycles. 

Let  R  he.  a  symmetric  relation.  The  relation  O  is  an  orientation  of  R  if  whenever  uRv 
then  either  uOv  or  vOu  holds.    The  relation  O  is  a  proper  orientation  of  /?  if  O  is  an  acyclic 
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orientation  of  R.  If  O  is  a  proper  orientation  of  R  and  nRv  then  exactly  one  of  uOv  or  vOu 
holds  O  is  obtained  from  /?  by  orienting  the  edges  of  the  (undirected)  graph  representing 
R  so  as  to  obtain  a  directed  acyclic  graph. 

We  characterize  the  code  by  a  relation  <V,P>.  V  is  the  set  of  operations  in  the 
program.  Intuitively,  uPv  if  u  occurs  before  v  in  one  of  the  program  segments.  Here,  P 
consists  of  a  union  of  disjoint  serial  orders,  one  per  program  segment; 

Two  distinct  operations  are  said  to  conflict  if  they  access  one  or  more  of  the  same 
variables  and  at  least  one  writes  one  of  these  variables.'*  We  denote  the  conflict  relation  by 
C.    The  relation  C  is  symmetric;  it  \:->  not  transitive  or  reflexive. 

An  execution  E  is  an  orientation  of  the  conflict  relation  C  Any  execution  of  a 
program  defines  an  execution  relation  E:  uEv  if  there  is  a  variable  that  is  accessed  by  u 
before  it  is  accessed  by  v.  such  that  either  u  or  \'  writes  on  this  vari.ible. 

An  execution  E  is  correct  if  it  is  acyclic  and  consistent  with  P,  i.e.  if  E[JP  can  be 
extended  to  a  total  ordering. 

An  execution  E  is  correct  iff  £  is  a  proper  orientation  of  C  and  the  graph  of  P  \^E  has 
no  cycles.  Informally,  E  is  correct  if  there  is  a  total  ordering  of  the  operations  that  is 
consistent  with  the  order  within  each  process  (i.e.  P) ,  and  with  the  order  in  which  memory 
accesses  take  effect  (i.e.  E) .  Thus,  if  E  is  correct  m  an  execution,  then  the  execution  is 
sequendally  consistent. 

In  the  example  above,  if  qiEp2  and  p  \Eq2  are  the  only  pau's  in  E,  then  the  graph 
E\JP  has  no  cycles.  One  possible  ordering  is  q\,p\,  p2,  qz  However  \i p2Eq\  and  52^Pi. 
there  is  a  cycle  in  E\JP  as  follows:  PiEqxPqi^-PiPp i-  The  cycle  indicates  that  this  second 
execution  order  is  not  correct. 

One  can  control  the  order  of  execution  of  operations  by  introducing  delays.  Delays 
are  introduced  between  operations  executed  by  the  same  processor  We  denote  by  uDv  the 
fact  that  operation  v  is  delayed  until  operation  u  is  executed  (the  processor  does  not  start 
memory  accesses  on  behalf  of  v  until  all  memory  accesses  on  behalf  of  u  have  completed). 


Actually,  one  can  use  a  slightly  more  general  definition:  operations  O  and  o'  conflict  if  either  there  is 
some  state  5  such  that  the  state  obtained  by  applying  c  to  s  and  then  o'  to  the  resulting  state  differs  from  the 
state  obtained  by  applying  o  then  o.  Under  such  a  definition,  two  additions  tn  a  location  would  not  conflict. 
We  use  the  less  general  definition  for  simplicity  of  presentation. 
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The  relation  D  is  a  partial  order. 

Delays  can  be  used  to  sequentially  consistent  execution  of  serial  code.  We  require 
that  D  orders  every  conflicting  pair  of  operations  within  a  program  segment.  All  the  delay 
predicate  we  propose  below  have  this  property  provided  that  P  orders  every  pair  of 
conflicting  relations  within  the  same  program  segment. 

If  u  and  V  are  conflicting  instructions  within  a  program  segment  then  either  uDv  or 
vDu,  so  that  either  all  the  memory  accesses  of  u  are  executed  before  any  memory  access  of 

V  or  all  memory  accesses  of  v  are  executed  before  any  memory  access  of  u.   Thus,  if  u  and 

V  belong  to  the  same  program  segment  and  uEv  then  uDv. 

The  relation  E\JD  cannot  have  a  cycle  that  is  wholly  contained  within  one  program 
segment;  such  cycle  would  also  be  a  cycle  of  D . 

If  uDv,  or  uEv  and  u  and  v  belongs  to  the  same  program  segment,  then  all  the 
memory  accesses  of  u  occur  before  any  memory  access  of  v.  If  uEv,  and  u  and  v  belong  to 
distinct  program  segments,  then  the  (unique)  access  of  u  to  shared  memory  occurs  before 
the  (unique)  access  of  v  to  shared  memory.  Thus  if  u  and  v  are  two  operations  that  access 
shared  memory,  and  there  is  a  path  from  u  to  v  in  the  graph  of  EIJD,  then  the  access  of  m 
to  shared  memory  occurs  before  the  access  of  v.  It  follows  that  the  relation  E{JD  cannot 
have  a  cycle  that  includes  operations  that  access  shared  memory. 

We  have  shown  that  E\JD  is  acyclic;  £  is  a  partial  order  consistent  with  D. 
Conversely,  any  execution  order  E  that  is  consistent  with  the  delay  relation  D  may  arise 
from  the  execution  of  the  program  with  these  delays. 

We  are  looking  for  a  delay  relation   D  with  the  following  properties: 

1)  Any  execution  order  that  is  consistent  with  D  is  correct,  i.e.    ensures  that  £  is  a  proper 
orientation  of  C  and  that  E  is  consistent  with  P . 

2)  No  proper  subset  of  D  has  property  1. 

We  say  that  D  forces  consistency  if  it  has  the  above  property  1.  For  example,  the 
relation  P  forces  consistency  for  serial  program  segments.  This  amounts  to  forcing  serial 
execution  of  memory  accesses  within  each  program.  In  many  practical  cases  P  is  not 
minimal,  so  violates  property  2. 
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2.2.   Border  pairs 

An  execution  E  is  incorrect  \i  P\JE  has  cycles.  All  such  cycles  are  also  cycles  of  the 
graph  of  PyC.  This  graph  is  known  in  advance,  and  its  cycles  indicate  potential  violations 
of  sequenti?!  consistency  at  run  time.  Suppose  that  we  enforce  a  delay  uDv  for  every  pair 
of  operations  uv  such  that  uPv,  and  uv  is  an  edge  on  a  cycle  of  PyC.  Then  each  cycle  of 
P\JC  is  also  a  cycle  of  D{JC.  It  follows  that  each  cycle  of  P[JE  would  be  a  cycle  of 
0(JE.    Since  Dyf  is  acyclic,  P\JE  \%  acyclic,  and  E  is  correct. 

[t  is  not  necessary  to  enforce  delays  for  each  P  edge  on  a  cycle  of  PyC:  If  uPvPw  are 
consecutive  P  edges  on  a  cycle  of  P'tjC,  then  a  delay  between  u  and  vv  is  sufficient  to 
break  this  potential  cycle.    This  is  formalized  below. 

A  (cyclic)  hst  of  instructions  ct  -  (\o,  ■  •  ,v„_i,vo)  is  a  mixed  cycle  of  (P ,C)  if  ct  is  a 
cycle  in  the  graph  of  PyC  that  contains  edges  both  from  P  and  from  C.  (Note  that  there 
may  be  both  a  P  edge  and  a  C  edge  between  v,  and  v,-.,.i.  So  the  cycle  may  be  fully 
contained  in  edges  from  C.  But  then  some  consecutive  pair  v,-  and  v^+j  in  the  cycle  must 
also  be  a  P  edge.) 

The  ordered  pair  of  operations  v,v,  +  i  is  an  (ordered)  border  pair  of  the  mixed  cycle  a 
if  (all  addition  is  modulo  n) 

(1)  v,-iCv:, 

(2)  v,.Pv,..i,  and 

(3)  v,  +  iCv,+2- 

In  the  example  above  (p-iP2lil2Pi)  is  a  mixed  cycle,  and  p-^Pi  's  a  border  pair  of  this 
cycle . 

Let  lip  l)e  the  relation  ''uv  is  a  border  pair  of  a  mixed  cycle".  S  is  a  transitive 
relation,  indeed,  let  uqUi  be  a  border  pair  in  the  mixed  cycle  (h0'''1'^'2'  '  '  ''^'m-i'^Oj  ^nd 
let  M1U2  '^^  ■^  border  pair  in  the  mixed  cycle  {u-[,H2,v' t,,  ■  ■  •  ,v'„-i,U]).    Then 

(z/o.''2.'''j.  •  ■  ■  ^''n-l^"i^'2'  •  •  •  .v„-i.«o) 
is  a  mixed  cycle,  and  /<o"2  is  ^  border  pair  of  this  cycle. 
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It  is  also  easy  to  see  that  if  vqVi  is  a  border  pair  of  a  mixed  cycle,  then  it  is  also  a 
border  pair  of  a  simple  mixed  cycle. 

Delay  Condition:  If  vqBv^  then  wait  for  vq  to  complete  before  issuing  v^,  i.e.  ensure 
vqDvi. 

Theorem  1:  B  forces  consistency.^  That  is,  if  an  execution  satisfies  the  delay  condition,  it 
will  be  sequentially  consistent. 

Proof:  Let  E  be  an  incorrect  execution  order.  Let  (vo.vj,  •  ■  •  ,v„_i,vo)  be  a  simple  cycle 
in  the  graph  of  P[JE  (n>\).  Since  both  P  and  E  are  irreflexive  and  asymmetric,  the  cycle 
contains  edges  from  both  P  and  from  E.  By  transitivity  of  P,  we  may  assume  w.l.o.g.  that 
the  cycle  does  not  contain  two  successive  P  related  pairs. 

Since  E  C  C  if  v,Ev,  +  i,  then  v,Cv,  +  i.  It  follows  that  (vq,  •  •  •  ,v„_i,vo)  is  a  mixed 
cycle  of  iP,C).  But  then  each  ordered  pair  v,v,  +  i  such  that  VjPv,  +  i  is  a  border  pair,  so  that 
v,Sv,+i.  Therefore  (vq,  •  •  •  ,v„_i,vo)  is  a  cycle  of  B[JE,  and  E  is  not  consistent  with  the 
delay  relation.    So   P[JE  contains  no  cycle.   □ 

2.3.   Examples 

The  following  examples  illustrate  the  last  results.  These  examples  only  show  conflicts 
at  memory  module  locations.  We  assume  that  the  various  operations  of  the  same  program 
segment  don't  conflict  in  the  registers  of  the  processor  executing  that  segment. 

The  first  example  illustrates  that  data  dependencies  within  one  program  segment  are 
handled  correctly  as  a  particular  case  of  border  pairs. 

Progra?}! 

(a)  Read  A 

(b)  Write  B 
(b')  Read  B 
(a')  Write  A 

We  have  a  mixed  cycle  (aa'a)  and  a  mixed  cycle  (bb'b).  The  pairs  aa'  and  bb'  are 
border  pairs,   so  that  (a')   is  delayed  until  (a)   terminates,  and   (b')   is  delayed  until  (b) 


Ultracomputer  Note  96  Page  9 


terminates. 

The  next  example  is  more  complex. 

Segment  1  Segment  2  Segment  3 

(a)  Read  A       (c')  Write  C       (d")  Read  D 

(b)  Read  B       (b')  V/rite  B       (a")  Write  A 

(c)  Read  C       (d')  Write  D 

We  have  three  mixed  cycles:  (,abb'd'd"a"a),  (acc'd'd"a"a),  and  (bcc'b'b).  See  figure 
2.  The  border  pairs  in  the  first  program  segment  are  ab,  ac,  and  be;  in  the  second  program 
segment  they  are  c'b',  c'd',  and  b'd';  in  the  third  program  segment  d"a"  is  a  border  pair. 
The  pair  ac  is  not  needed  as  it  is  implied  by  ab  and  be;  same  goes  for  c'd'.  The  delays 
needed  are  ab,  be,  c'b',  b'd'  and  d"a".  In  this  example  each  memory  access  is  delayed 
until  the  previous  access  has  terminated. 

This  example  shows  that  it  is  not  necessary  to  record  all  delays  implied  by  the  border 
pair  relation  B;  it  is  sufficient  to  record  a  set  of  delays  such  that  their  transitive  closure 
contains  B.    We  shall  return  to  this  later. 

The  last  example  illustrates  that  our  method  can  save  delays. 

Segment  I  Segment  2 

(a)  Read  A  (b')  Write  B 

(b)  Read  B  (c')  Write  C 

(c)  Read  C  (d')  Write  D 

(d)  Read  D  (a')  Write  A 

Figure  3  shows  the  conflict  and  program  order  edges.  We  have  four  mixed  cycles: 
(abb'c'cdd'a'a),  (abb 'a 'a),  (acc'a'a)  and  (add'a'a).  See  figure  4.  The  border  pairs  in  the 
first  program  segment  are  ab,  ac,  ad,  and  cd;  in  the  second  program  segment  they  are  b'c', 
b'a',  c'a',  and  d'a'.  In  the  first  program  segment  all  operations  are  delayed  until  (a)  is 
executed  and  (d)  is  delayed  until  (c)  is  executed.  In  the  second  program  segment  (a')  is 
delayed  until  all  operations  have  been  executed;  (c')  is  delayed  until  (b")  has  been 
executed.    On  the  other  hand,  (b)  can  be  executed  after  (c)  or  (d),  even  though  these  are 


^  Remember,  we  are  dealing  with  straight  line  code. 
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accesses   to   shared  read-write  variables.     A    symmetric  condition  holds   for   the   second 
program  segment. 

2.4.   Minimal  delays 

The  last  example  also  shows  that  our  solution  is  not  minimal:  in  order  to  force 
consistency  it  is  sufficient  to  delay  all  the  operations  in  the  first  program  segment  until  (a) 
has  terminated,  and  to  delay  the  execution  of  (a')  in  the  second  program  segment  until  all 
the  other  operations  have  terminated.  The  delays  between  (c)  and  (d)  in  the  first  program 
segment,  and  between  (b')  (c')  in  the  second  program  segment  are  superfluous. 

It  is  quite  easy  to  see  what  went  wrong  in  the  last  example:  Although  (cdd'a'abb'c')  is 
a  mixed  cycle,  it  is  not  a  minimal  mixed  cycle;  the  mixed  cycles  (abb'a'),  (acc'a')  and 
(add'a')  are  all  contained  within  this  cycle.  This  is  repaired  by  using  minimal  cycles  in  the 
definition  of  delay  pairs. 

A  simple  mixed  cycle  (vq,  ■  •  •  ,v„_i,vo)  of  (P ,C)  is  minimal  if  either  no  proper  subset 
of  its  nodes  form  a  mixed  cycle,  or  any  proper  subset  that  does  consists  of  consecutive 
nodes  (v,-,v,  +  i,v,)  where  v,Pv,  +  iCv,-  (addition  is  modulo  n,  as  usual). ^  See  figure  5. 

Intuitively,  a  minimal  mixed  cycle  should  have  at  most  two  operations  per  program 
segment  that  are  related  by  the  program  ordering  P  (note  that  if  the  program  ordering  is 
total,  there  is  no  need  for  the  last  restriction).  If  there  were  three,  the  middle  one  could  be 
eliminated.  The  two  operations  from  the  same  program  segment  should  also  be 
consecutive  in  the  cycle,  i.e.  one  should  be  v,-  and  the  other  should  be  v,  +  i  for  some  i.  If 
there  are  two  operations  o  and  o'  from  the  same  program  segment  such  that  oPo'  but  o  and 
o'are  not  consecutive  in  the  cycle,  then  the  path  from  o  to  o'  can  be  removed  to  obtain  a 
smaller  cycle. 

Lemma  1:  Let  Q  and  C  be  relations  such  that  Q  is  transitive  and  C  is  symmetric.  Let 
CT  =  (vq,  •  •  •  ,v„_i)  be  a  simple  mixed  cycle  of  {Q,C).  Then  a  is  a  minimal  iff  the 
following  conditions  hold. 


^  When  memory  access  operations  are  just  reads  and  writes  on  single  memory  locations  (as  opposed  to  sets 
of  locations),  then  the  definition  of  a  minimal  cycle  can  be  simplified:  A  mixed  cycle  is  minimal  if  no  proper 
subset  of  its  nodes  forms  a  mixed  cycle.  The  more  general  definition  makes  no  assumptions  on  the  semantics  of 
memory  accesses. 
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(1)  v.Qvj  and  i¥=j  implies;  =  /±1  (mod  n) . 

(2)  v,rv,-  and  O^j  implies  ;  =  /±1  {mod  n) . 

Proof:  Let  ct  =  (vq,  •  ■  •  ,v„_i,vo)  be  a  minimal  mixed  cycle. 

(only  if;  Assume  v,Qv^,  where  ;#j  and  ji^i±\P  Then 

-n-  =  {Vj,vj+i,  ■  ■  ■  ,v,-,Vj) 

is  a  cycle  m  Q{JC  consisting  of  a  proper  subset  of  nodes  of  a.  Since  a  is  minimal  it 
follows  that  -rr  is  not  mixed,  so  thai  the  path  v^,v^  +  i,  •  •  •  ,v,-  is  contained  in  Q-C .  Since  Q 
is  transitive  we  obtain  that  vfiv^.  But  then  (v,-,v,  +  i,  •  •  •  ,Vj,v,)  is  a  cycle,  and  this  cycle  is 
mixed.    Thus  a  is  not  minimal. 

Assume  v,C\'^-,  where  j=^i  and  ji^i±\.  Then  both  (v,-,v,  +  i,  •  ■  ■  ,v^-,v,)  and 
(v;,v;  +  i,  ■  ■  ■  ,v,)  are  cycles  m  Q{JC\  at  Jeast  one  of  these  two  cycles  is  mixed.  Thus  ct  is 
not  minimal.   It  follows  that  if  a  is  minim.al  then  conditions  (1)  and  (2)  hold  true. 

(if)  If  CT  is  a  simple  cycle  that  fulfills  these  two  conditions,  then  no  edge  connects  pairs 
of  nodes  on  u  in  the  graph  of  QIJC,  with  the  exception  of  the  edges  that  are  on  the  cycle, 
or  their  reversals.  It  follows  that  any  cycle  in  Q\^C  with  nodes  in  the  set  {vq,  •  •  ■  ,v„_i} 
is  either  identical  to  a  or  the  reversal  of  a,  or  is  of  the  form  (v,,v,  +  i,v,).  Thus,  ct  is 
minimal,    a 

Lemma  2:  Let  (2  be  a  transitive  relation,  C  be  a  symmetric  relation,  and  £  be  a  proper 
orientation  of  C .  Let  a  =  (vq,  ■  •  •  ,v„_i,vo)  be  a  simple  mixed  cycle  of  {Q,E).  Then  the 
following  conditions  are  equivalent. 

(1)  <j  is  a  minimal  mixed  cycle  of  {Q,E). 

(2)  a  is  a  minimal  mixed  cycle  of  {Q,C). 

(3)  CT  fulfills  conditions  (1)  and  (2)  of  the  previous  lemma,  i.e.    if  v^Qvj,  with  i=f=j,  then 
i  =  i±\,  and  if  \',Cv_,-,  with  /'#;,  then  ;  =  /±l. 

Proof:  We  have  proven  in  the  previous  lemma  the  equivalence  of  conditions  (2)  and  (3). 
Clearly,  if  ct  is  a  simple  mixed  cycle  of  {Q ,E) ,  and  is  a  minimal  mixed  cycle  of  {Q,C),  then 
it  is  also  a  minimal  mixed  cycle  of  {Q,E).    Thus  (2)  =>  (1).    We  shall  complete  the  proof 


^  It  is  irrelevant  to  our  proof  whether  i<j  or  i> j . 
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by  showing  that   (1)  =>  (3). 

If  ViQvj,  where  j¥=i  and  j=^i±\  then  the  argument  used  in  the  proof  of  the  previou.'. 
lemma  shows  that  a  is  not  minimal. 

If  ViCvj,  where  j¥=i  and  j^i±\  then  either  v,Ev^-  or  v^Ev,.  Assume  w.l.o.g.  that  the 
former  holds.    (Again,  we  don't  care  whether  i<j  or  />;.)  Then 

IT    =    (V_,-,V^  +  1,    •    •    •    ,V,-,V^) 

is  a  cycle  in  the  graph  of  Q[JE.  Since  E  is  acyclic  the  path  v^-.v^+i,  •  ■  •  ,v,-  must  contain  an 
edge  in  Q  —  E;  the  cycle  tt  is  mixed.    □ 

An  ordered  pair  vqVi  is  a  critical  pair  if  it  is  a  P  edge  of  a  minimal  mixed  cycle  of 
(P.O. 

Theorem  2:  Let  D  be  the  "critical  pair"  relation.    Then  D  forces  consistency. 

Proof:  Assume  that  D  does  not  force  consistency.   Let  E  be  an  execution  order  relation  that 

is  consistent  with  D,  but  not  with  P.   Let  ct  =  (vq ,v„_i,vo)  be  a  minimal  cycle  in  P\JE. 

Any  minimal  cycle  in  P[JE  must  be  mixed  since  both  P  and  E  are  irreflexive  and 
transitive.  Thus  ct  is  a  minimal  mixed  cycle  of  (P ,E)  and,  according  to  Lemma  2,  ct  is  a 
minimal  mixed  cycle  of  iP,C).  Thus,  if  v,Pv,  +  i  then  v,Dv,  +  i,  and  ct  is  a  cycle  in  the  graph 
of  E[JD;  a  contradiction.    □ 

The  last  solution  turns  out  to  be  optimal. 

Theorem  3:  Let  P  be  a  relation  that  is  consistent  with  P  and  forces  consistency.  Then  the 
critical  pair  relation  D  is  contained  in  the  transitive  closure  of  R. 

We  shall  use  the  following  lemma  in  our  proof: 

Lemma  3:  Let  G  =  <V,E>  be  a  directed  acyclic  graph,  and  let  u  and  v  be  two  nonadjacent 
nodes  of  G.   Then  either  Gj  =  <V,E[Juv>  or  G2  =  <V,£Uvi<>  are  acyclic  graphs. 
Proof:  If  both  G^  and  G2  contain  cycles,  then  there  is  a  path  in  G  from  u  to  v,  and  a  path 
from  V  to  M.   Hence  G  contains  a  cycle;  contradiction.    □ 

Proof  of  Theorem  3:  We  assume  w.l.o.g.  that  R  is  transitive;  we  shall  prove  that  DCR. 
Let  CT  =  (vq,  ■  ■  ■  ,\'„_i,vo)  be  a  minimal  mixed  cycle  (in  (P,C)),  and  let  vqVi  be  a  critical 
pair  in  this  cycle.  Assume  that  vqVi?.R.  We  shall  construct  an  execution  order  E  that  is 
consistent  with  R,  such  that  ct  is  a  cycle  in  E[JP.    Hence  E  is  not  correct,  and  R  does  not 
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force  consistency. 

Define  the  relation  £"  by  v,iE"v,  +  i  if  v,Cv',+i.  (We  will  later  extend  E'  to  an  incorrec; 
conflict  ordering  E.) 

Claim:  The  graph  oiE'{jR  has  no  cycles. 

The  claim  is  trivial  if  ct  =  (vo.vj.v'o):  Then  £"  =  v^vg  and  R  contains  no  path  from  Vq  to  Vj, 
otherwise  x'ovj^/?,  contradicting  our  assumption. 

Consider  now  the  case  n>2.  Assume  that  E'{JR  has  a  cycle;  let  tt  be  a  minimal  such 
cycle.  This  cycle  fulfills  conditions  (1)  and  (2)  of  Lemma  1  (by  Lemma  2).  In  particular, 
•7T  does  not  contain  lwo  successive  P.  edges,  so  that  all  the  nodes  of  tt  are  endpoints  of 
edges  in  E'  It  follows  from  our  construction  of  £"  that  the  nodes  of  it  are  nodes  of  cr,  and 
by  the  minimality  of  ct,  tt  and  ct  have  the  same  set  of  nodes.  As  ct  fulfills  conditions  (1) 
and  (2)  oi  Lemma  1  the  edges  of  it  must  coincide  with  edges  of  ct,  or  their  reversal.  In 
particular,  either  vqVj  or  v-^vq  is  an  edge  of  it.  But  v^^VQiR  (since  vqVi^P),  and  vqVj^C 
(otherwise  (vo,vi,vo)  is  a  mixed  cycle  of  {P,C)).  It  follows  that  v^v-^^R,  contradicting  our 
assumption.    This  proves  the  claim. 

As  E'\JR  is  acyclic,  we  can  extend  £'  to  an  execution  order  E  consistent  with  R  by 
repeated  applications  of  Lemma  3.  However  E\JP  contains  a  cycle  -  the  cycle  ct;  hence  E 
is  not  consistent  with  P  and  is  therefore  incorrect.    □ 

3.   Large  Atomic  Operations  And  Locks 

In  this  section,  we  consider  the  problem  of  enforcing  sequential  consistency  at  the 
level  of  complex  atomic  constructs,  i.e.  instructions  in  the  source  program  that  translate  to 
multiple  machine  language  operations.  We  assume  that  two  mechanisms  can  be  used  for 
that  purpose: 

(i)  A  processor  may  delay  issuing  an  operation  until  some  other  operation  has  terminated 
(these  are  the  delays  used  in  the  previous  sections). 

(ii)  A  processor  may  lock  a  memory  location.  There  are  two  kinds  of  locks:  read  locks 
and  write  locks.  Several  processors  may  simultaneously  hold  read  locks  on  the  same 
location;  only  one  processor  may  hold  a  write  lock.  Moreover,  if  one  processor  holds 
a  write  lock  on  a  location,  no  other  processor  may  hold  a  read  lock  on  that  location. 
A  processor  may  read  a  location  only  if  it  holds  a  read  or  a  write  lock  on  it;  it  may 
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modify  a  location  only  if  it  holds  a  write  lock  on  it.    Sometime  after  the  processor  has 
finished  accessing  a  location,  it  unlocks  the  location. 

We  shall  restrict  our  attention  to  nonbranching  code.  The  results  can  be  extended  to 
branching  code  as  we  will  see  in  the  next  section.  The  problem  is  formalized  as  follows. 
We  have  two  relations  defined  on  the  set  V  of  operations: 

(i)     The  relation  P  specifies  the  order  within  each  program  segment:  uPv  if  u  and  v  are 
operations  of  the  same  program  segment,  and  u  precedes  v. 

(ii)    The  relation  A  specifies  atomicity  constraints:  uAv  if  u  and  v  are  operations  of  the 
same  program  segment,  and  belong  both  to  the  same  atomic  block. 

The  following  conditions  (l)-(4)  state  that  the  order  of  execution  of  atomic  blocks  is 
acyclic,  and  the  order  of  execution  of  operations  within  each  atomic  block  is  acyclic. 

(1)  The  relation  A  is  an  equivalence  relation. 

(2)  The  relation  P  is  a  partial  order  (i.e.  transitive  and  irreflexive).    Hence,  so  is  Pf^A. 

(3)  Let  P/A  be  the  relation  induced  on  the  equivalence  classes  of  A  by  the  relation  P: 
[u]P/A[v]  if  there  exist  m'€[m]  and  a''C[v]  such  that  u'Pv' .   Then  P/A  is  a  partial  order. 

(4)  The  relation  P  is  "closed"  under  A;  i.e.    if  [«]P/A[v]  then  uPv. 

These  conditions  imply  an  alternate  point  of  view:  there  is  an  order  relation  P/A 
defined  on  A  equivalence  classes,  and  an  order  relation  P (~^A  defined  within  equivalence 
classes.   The  relation  P  is  the  lexicographic  ordering  defined  by  P/A  and  Pp)A. 

The  conflict  relation  C  is  defined  as  originally,  i.e.  uCv  if  both  access  the  same 
location  and  at  least  one  is  a  write. 

A  total  ordering  S  of  the  code  is  correct  if 

(i)     S  is  consistent  with  P ,  i.e.   PCS,  and 

(ii)    Operations  belonging  to  the  same  equivalence  class  of  A  are  executed  consecutively, 
i.e.  if  uSv,  vSw,  and  uAw ,  then  uAv  (and  vAw). 

A  execution  order  E  is  correct  if  there  is  an  extension  of  f  to  a  correct  total  ordering. 

Condition  (ii)  implies  that  the  effect  of  the  execution  is  as  if  each  atomic  block  were 
executed  indivisibly;  the  first  condition  implies  that  the  effect  of  the  execution  is  as  if  the 
operations  were  executed  in  the  order  specified  by  P. 
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Let  L  he  an  equivalence  relation  and  D  be  a  partial  order  both  defined  on  pairs  of 
operations  that  belong  to  the  same  program  segment.  The  relation  L  is  a  locking  relatior 
that  partitions  the  operations  into  equivalence  classes  {locking  sets)  that  are  locked 
together;  Z>  )s  a  delay  relation.  Execution  of  the  piogram  with  the  constraints  determined 
by  L  and  D  is  defined  as  follows. 

Delay  and  locking  conditions:  Each  processor  issues  operations  while  respecting  the 
following  two  restrictions: 

(1)  If  uDv  then  v  is  not  issued  until  u  lias  executed. 

(2)  Let  [u]  be  the  equivalence  class  of  operation  u  under  the  equivalence  relation  L. 
Then,  before  issumg  the  first  operation  from  [u],  a  processor  read-locks  all  memory 
location.s  read  but  not  written  by  operations  from  [u];  it  write-locks  all  locations 
written  by  operations  from  [u];  v  releases  these  locks  after  executing  all  operations  in 
[ii]. 

An  execution  order  E  is  consistent  with  L  and  D  if  it  may  result  from  an  execution  that 
fulfills  the  delay  and  locking  conditions  defined  by  D  and  L.  The  pair  of  relations  (L,D) 
forces  correctness  if  any  execution  order  that  is  consistent  with  D  and  L  is  correct.  For 
example  iA,P)  forces  correctness.  We  show  in  this  section  how  to  improve  on  this  trivial 
solution. 

Lemma  4:  The  following  assertions  are  equivalent: 

(1)  The  execution  order  relation  E  is  correct. 

(2)  PljE  and  P/A{JE/A  are  partial  orders. 

(3)  (PDA .  Ep^A)  and  (P\jA,  E-A)  have  no  mixed  cycles. 

Proof:  Note  first  that  the  fir.'-t  conjuncis  of  conditions  (2)  and  (3)  are  not  redundant:  Even 
if  P/A[JE'\  is  a  partial  order,  P[J£  might  stil)  have  cycles  that  are  wholly  contained 
within  an  e  luivalence  class  of  A.  Similarly,  {PC\A  ,E(~]A)  might  have  a  mixed  cycle  that  is 
not  a  mixcii  cycle  of  iP\j A,  E) ,  as  if  is  contained  within  a  single  A  equivalence  class. 

If  PjjE  is  nof  a  partial  order  then  it  has  a  cycle  a.  Since  P  and  E  are  acyclic,  ct 
contains  edges  in  P  —  E  and  edges  in  E-P.  [f  the  nodes  of  a  are  contained  in  an 
equivalence  class  of  A,  then  it  is  a  niixed  cycle  of  {P  (~^A,Ef\A),  Otherwise  this  is  a  cycle 
of  (P(JA.,  E—A)     Since  P/A  is  acyclic,  the  cycle  contains  edges  from  E-A,  and  is  mixed. 
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If  P/A\JE/A  is  not  a  partial  order  then  it  has  a  cycle.  This  cycle  corresponds  to  a 
mixed  cycle  of  {P\JA,  E  —  A). 

Thus,  (3)  =>  (2)  (we  have  shown  the  contrapositive). 

If  (2)  holds  then  it  is  possible  to  extend  P/A\JE/A  to  a  total  order  of  the  equivalence 
classes  of  A;  it  is  possible  to  extend  P[JE  io  a  relation  that  totally  orders  each  equivalence 
class  of  A.  The  lexicographic  ordering  defined  on  the  nodes  of  the  program  by  these  two 
order  relations  is  a  correct  extension  of  £.   Thus  (2)  =>  (1). 

Finally,  it  is  easy  to  see  that  if  £  can  be  extended  to  a  correct  total  ordering,  then 
{Pf\A,Ef\A)  and  {P\JA,E-A)  have  no  mixed  cycles,  so  that  (1)  =>  (3).   □ 

Lemma  4  indicates  the  way  to  enforce  correct  execution.  One  can  prevent  the 
occurrence  of  cycles  in  P\JE  within  equivalence  classes,  by  delaying  critical  pairs  in 
minimal  mixed  cycles  of  {Pf\A,Cf\A).  Similarly,  one  can  prevent  the  occurrence  of 
mixed  cycles  in  (PjjA.E-A),  by  delaying  critical  pairs  in  minimal  mixed  cycles  in 
{P[JA,C-A).  However,  as  PljA  is  not  irreflexive,  we  may  get  contradictory 
requirements:  i.e.  u  should  be  delayed  until  v  executes,  and  vice  versa.  In  such  cases,  we 
must  lock  the  variables  accessed  by  u  and  v  before  either  executes.  Once  we  have  found 
the  set  of  operations  that  must  be  locked,  we  may  relax  the  delay  requirements,  as  locks 
take  care  of  part  of  the  constraints  we  intended  to  enforce  with  delays. 

Let  us  give  now  a  formal  description.    Let  D  q  be  the  set  of  critical  pairs  in  minimal 

mixed  cycles  of  {P  (~\A  ,C  f\A)  Let  D  i  j  be  the  set  of  critical  pairs  in  minimal  mixed  cycles 

oi{P{jA,C-A). 

Let  LQA  be  an  equivalence  relation.  Define  Di„,(L)  to  be  the  set  of  critical  pairs  in 
minimal  mixed  cycles  of  (P  (~]L,C  f]L)  (i.e.  cycles  whose  nodes  are  contained  within  an 
equivalence  class  of  L). 

Finally,  let 

D(L)  =  D,JL)U((Dp|UC»(j)  -  L)  . 

We  have 
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Lemma  5:   Any  execution  order  E  that  is  consistent  with  L  and  D(L)  is  correct. 

Proof:    Let  -   be  rir  execution  order  that  results  from  an  execution  with  delay  and  locking 

conditions  D(L)  and  L.   Assume,  by  contradiction,  that  E  is  not  correct. 

Case  a:  i,Pf)AE(~]A)  has  a  mixed  cycle.  Let  a  =  (vq,  •  ■  ,v„_i)  be  a  minimal  such 
cycle.    According  to  Lemma  1  ct  is  a  minimal  mixed  cycle  of  {P  f]A,C  f^A). 

a.l  -  The  nodes  of  ct  are  contained  in  an  equivalence  class  of  L.    Then  ct  is  minimal  mixed 

cycle  of  (Pf)L,Cp,L);  v,Pv,  +  i  implies  v,D,„,(Z.)v,  +  i,  so  that  ct  is  a  cycle  in  D,„,(L)UE;  a 

contradiction. 

a. 2  -  The  nodes  of  a  .ire  not  all  within  the  same  locking  set.    If  v^Avi+i  and  v,-4.iPv,+2  then 

v,Pv,+2     It  follows  that  CT  does  not  contain  A  edges  followed  by  P  edges.    The  cycle  ct  can 

be  split  into  disjoint  segments 

\,  ■  ■  •  ,v,,-.i,v,-^,  •  •      ,v,-^_i,  •  ■  • 

such  that  the  nodes  v.- ,  •  •  •  ,v,-  _i  belong  to  the  same  equivalence  class  of  L.  There  is 
more  than  one  such  segment.  If  v,_jE\',-  then  i,  is  executed  after  v,_i;  the  instructions 
v,-_i  and  V,-  conflict,  and  require  conflicting  locks  on  the  same  location.  The  lock  for  v,  is 
secured  only  after  the  lock  for  v,-  _i  is  released  and,  therefore,  only  after  all  the  operations 
in  the  lockin"  set  of  v.-  _i  have  executed.  Thus  ,  executes  after  V;  ,  •  •  •  ,v;  _i.  If  v.-  _iPv; 
then  Vj  _j(D  f^— Z.)v,-  and,  again,  v,  is  executed  after  v,_i.  Also,  by  a  previous  remark, 
v,_i  =  V;  ,  ■  e.  the  segment  of  consecutive  operations  in  the  same  lock  set  as  v;  has 
length  one.    Tn  either  cases  \'      is  executed  before  V:  yielding  a  contradiction. 

Case  b  {P{JA,E-A)  has  a  mixed  cycle.  Let  ct  =  (vq,  •  •  ,v„_i)  be  a  minimal  such 
cycle  Then  ct  is  a  minimal  mixed  cycle  of  'P{JA,C—A).  If  v,(P  |J  A)v,+i  then  either 
ViLvj^-^    or    v,(D  I  |  — L)v,-^.j.     The    same    argument    as    in    case    a    is    used    to    derive    a 

contradiction.   □ 

The  claim  of  the  last  theorem  may  be  vacuous:  Since  P\JA  has  cycles  Dij-L  may 

turn  out  .0  be  cyclic  too  In  this  case  the  set  of  delays  D{L)  is  unenforceable,  and  no 
execution  order  is  consistent  with  L  and  D{L).  Conversely,  if  D{L)f\L  and  D(L)/L  are 
both    acyclic    then    the    delay    and    locking    conditions    represented    by    D(L)    and    L    are 
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enforceable:  D(L)  is  acyclic,  and  there  is  no  cyclic  dependency  of  locking  operations.    Our 
purpose  is  to  find  the  "smallest"  locking  relation  L  such  that  DiL)  is  acyclic. 

Let  D  =  Dj-^UDy.    Then  DiL)  =  Di„,iL)]JiD-L)  C  L\J{D-L).    If  both  mv  and 

VM  are  in  the  transitive  closure  of  D  then  the  delays  uDv  and  vDu  are  unenforceable;  we 
must  put  the  pair  uv  in  L.   It  turns  out  that  it  is  also  sufficient  to  put  these  pairs  into  L. 

Let  us  call  pairs  mv  such  both  uv  and  vu  are  in  the  transitive  closure  of  D  tight.  Let  T 
be  the  tight  relation.  T  is  an  equivalence  relation;  the  equivalence  classes  of  T  are  the 
strongly  connected  components  of  the  graph  of  D. 

Lemma   6:     Let   T  be   the   tight   equivalence   relation.     Let  DiT)   be   the  delay   relation 
associated  with  the  locking  relation  T.   Then  D(T)f)T  and  D(T)/T  are  acyclic. 
Proof:  We  have  DiT)f]T  =  D^„,{T)QP,  and  P  is  acyclic.    Also,  D{T)-T  =  D  -  7,  so  that 
D{L)/L  =  DIL.    But  DIL  is  the  graph  induced  by  D  on  the  strongly  connected  components 
of  D;  this  graph  is  acyclic.    □ 

The  following  theorem  sums  up  the  results  of  this  section. 

Theorem  4:  Let  D  (-n  be  the  set  of  critical  pairs  of  minimal  mixed  cycles  of  {P  ^A  ,C  ^A) . 

Let  D  I  1  be  the  set  of  critical  pairs  of  minimal  mixed  cycles  of  {P\JA,C  —  A).    Let  T  be  the 

equivalence  relation  consisting  of  the  set  of  pairs  uv  such  that  both  uv  and  vu  are  in  the 
transitive  closure  of  D  rs\jD  r-s.    Let  Din,(T)  be  the  set  of  border  pairs  of  minimal  mixed 

cycles  of  (Ppi^'^^OT')-    Finally,  let 

D(7)  =  DaT)[jaD^[jD)^-T)  . 

Then  the  locking  condition  T  and  delay  condition  D{T)  are  enforceable,  and  force  correct 
execution.   □ 

3.1.   Examples 

We  illustrate  the  last  theorem  with  several  examples.  As  before,  these  examples  only 
show  conflicts  at  memory  module  locations.  We  assume  that  the  various  operations  of  the 
same  program  segment  don't  conflict  in  the  registers  of  the  processor  executing  that 
segment.     We  enclose  blocks  of  statements  that  are  to  be  executed  atomically  in  curly 
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brackets;  a  comma  separates  statements  that  can  be  executed  concurrently  (semicolon  has 
higher  priority  than  comma) . 

First  Example 

Segment  1  Segment  2 

[p]  Write  a;       qi  Read  a, 
P2  Write  b,        ^2  Read  b 
Pj  Read  b; 
P4  Read  a} 

(P  f)A,C  f]A)  has  a  minimal  mixed  cycle  {p -[P 2P 3P aP {) ■  See  figure  6.  The  pairs  pi/?2 
and  /?3/74  are  critical  pairs  of  this  cycle  (P[JA,C  —A)  has  a  minimal  mixed  cycle 
PtPiQi^tPi  The  pairs  P2P\  a  "id  q\q2  are  critical  pairs  of  this  cycle  Thus,  the  pair  p^pi  is 
tight  and  a  and  b  must  be  locked  together  when  these  two  instructions  are  executed.  All 
other  accesses  are  executed  separately.  p4  is  delayed  until  p-x,  executes  and  ^2  is  delayed 
until  qi  executes.  No  delay  is  required  between  p2  and  p^  even  though  these  operations 
are  on  a  cycie:  the  lock  ensures  that  either  ^3  occurs  before  pj  and  p2  or  vice  versa. 

Second  Example 

Segment  1        Segment  2  Segment  3 

PiReada;       (^1  Write  a;       /'i  Read  c; 
P2  Read  b        ^2  Write  c}        ,-2  Write  b 

There  are  no  cycles,  neither  in  the  graph  of  P  f)A[jC  f^.A,  nor  in  the  graph  of 
PUA\^C  —  A.  See  figure  7.  Thus  no  delays  or  locks  are  needed  for  the  correct  execution 
of  this  program.  Any  interleaving  will  be  equivalent  to  some  interleaving  where  qi  and  92 
are  executed  consecutively. 

Third  Example 
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Segment  1        Segment  2  Segment  3 

;?iReada;       {^^  Write  a;      rj  Write  b; 
P2'RtSidh       ^2  Write  c}       r2Readc 

The  third  program  differs  from  the  second  in  that  the  order  of  the  statements  in  the 
last  program  segment  has  been  reversed.  We  now  have  a  mixed  cycle  in  iP[JA,C—A)  and 
the  cycle  {p\P2^i''2Q2Q\Pi)  See  figure  8. 

Correct  execution  is  ensured  by  inserting  the  delays  P]P2,  fi^i'  s'ld  <?2'?i-  We  have 
the  paradoxical  result  that  in  order  to  enforce  atomic  execution  of  qi  followed  by  ^2  we 
have  to  issue  ^2  first,  and  delay  the  issuing  of  q^  until  ^2  has  executed.  Still,  no  locks  are 
required. 

3.2.   Optimizations 

We  do  not  make  any  assumptions  about  the  mechanism  used  to  secure  locks;  any 
synchronization  method  that  yields  the  effect  of  locks  can  be  used.  An  important 
optimization  may  be  obtained  whenever  an  equivalence  class  under  the  relation  L  consists 
of  just  one  operation.  In  that  case,  it  is  sufficient  to  check  that  the  location  accessed  by 
that  operation  is  not  currently  locked  by  another  processor.  (In  fact,  we  can  even  do 
better:  if  the  operation  only  reads  the  location,  then  it  is  only  necessary  to  check  that  no 
other  processor  holds  a  write  lock  on  that  location.)  This  has  the  same  effect  as  a  lock, 
access,  and  unlock,  so  our  previous  theorems  still  hold  with  this  optimization.  In  the  even 
more  special  case  when  a  location  is  always  accessed  by  an  equivalence  class  of  size  1,  no 
locks  on  that  location  are  needed  at  all. 

We  assumed  in  the  previous  section  that  all  locations  accessed  by  operations  in  the 
same  locking  class  are  locked  before  any  operation  in  the  class  executes,  and  unlocked  only 
after  all  operations  have  executed.  Instead,  we  can  use  a  weaker  two-phase  locking 
protocol  [EGLT].    The  locking  condition  is  replaced  by  the  following  condition 

(2')  Let  [u]  be  the  equivalence  class  of  operation  u  under  the  equivalence  relation  L.  Then 
no  lock  is  granted  on  the  behalf  of  an  operation  in  [;/]  after  locks  used  by  operations 
in  [u]  have  been  released. 
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Thus,  the  protocol  proceeds  in  two  phases,  a  first  one  where  locks  are  secured,  and  a 
second  one  where  locks  are  released.  A  lock  to  a  location  is  secured  before  operations  ir 
[m]  that  access  the  location  proceed,  and  is  released  only  after  all  such  operations  have 
executed  (i.e.  have  been  acknowledged).  However,  memory  accesses  may  start  before  all 
locks  have  been  secured,  and  continue  alter  some  locks  have  been  released. 

On  the  other  hand,  we  strengthen  the  dela>  condition.  The  following  new  condition 
replaces  the  previous  one: 

(1')  Assume  that  uDv.  Then  if  u  and  v  are  in  the  same  lock  set  then  v  is  not  issued  until  u 
has  executed;  if  ii  and  v  belong  to  distinct  lock  sets,  then  the  lock  for  v  is  not  secured 
until  the  lock  for  u  is  released. 

We  say  that  [v]  "occurs  after"  [//]  if  [uj  =  [v]  and  uPv,  or  if  there  is  a  location  for 
which  [m]  and  [v]  require  conflicting  locks,  and  the  lock  for  [v]  is  secured  after  the  lock  for 
[u]  IS  released.  The  "occurs  after"  relation  is  a  partial  order.  We  say  that  v  "executes 
after"  [u]  if  [/<]^[v]  and  [v]  "occurs  after"  [m],  or  [v]  =  [«],  and  v  is  executed  in  memory 
after  u  is  executed.  The  relation  "executes  after"  is  a  lexicographic  combination  of  two 
partial  orders,  and  hence  is  a  partial  order.  If  uEv  then  v  "executes  after"  u;  if  also 
[m]t*=[v],  then  I'  "executes  after"  all  operations  in  [u].   If  uDv  then  v  "executes  after"  u. 

The  new  definition  of  execution  order  has  all  the  relevant  properties  of  the  previous 
one.  Thus,  ail  the  arguments  given  in  the  last  section  are  valid,  with  the  new  delay  and 
locking  conditions,  and  the  new  definition  of  execution  order.  In  particular,  the  locking 
and  delay  conditions  L(7)  and  T  defined  in  Theorem  4  ensures  correct  execution,  for  any 
two-phase  locking  policy  consistent  with  L. 

3.3.   Minimality 

We  shall  prove  in  this  section  that  the  locking  and  delay  conditions  defined  are 
optimal.  Firstly,  any  locking  relation  must  contain  the  tight  locking  relation  T, 
irrespectively  of  the  delays  used.  Thus,  T  is  minimal.  The  delay  relation  D(T)  is  not 
necessarily  minimal:  It  is  possible  to  enforce  correct  execution  with  less  delays.  This, 
however,  requires  more  operations  to  be  in  locking  sets;  delay  pairs  are  replaced  by 
locking  pairs.   This  is  made  precise  in  the  following  theorem. 
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Theorem  5:  Let  L  be  a  locking  relation  and  D  be  a  delay  relation  such  that  (,L,D)  force 
correctness.   Then 

(1)  LDT; 

(2)  Df^TDD(T)f)T;  and 

(3)  D]JLDD(T)\JT. 

Proof:  (outline)  Assume  there  is  a  critical  pair  of  a  minimal  mixed  cycle  of  {P  {~]A  ,C  (~)A) 
that  is  not  contained  in  D[JL.  Then  using  a  similar  construction  as  in  Theorem  3,  we 
build  an  execution  order  E  that  is  consistent  with  D  and  L,  such  that  iPf}A,Ef^A)  has  a 
cycle.      Thus,    Dr>.CD\JL.      Similarly    we    show    that    D^CD\JL.     It    follows    that 

DiT)CD]jL. 

If  both  MvCDp>UDij  and  vu^D^\jD\j  then  «v€D|jL  and  vu^D\jL.    Since  D  is 

acyclic,  this  implies  that  uv^L.  It  follows  that  TCL,  which  proves  (1),  and 
DiT)\jTCD\jL,  which  proves  (3). 

Let  Mv  be  a  critical  pair  of  a  minimal  mixed  cycle  ct  in  (P  f)T,C  f]T).  Since  TCL,  ct  is 
a  minimal  mixed  cycle  of  (Pn^-^Pl^)-  ^^  "^'  ^^  ^lot  contained  in  D,  then  we  can  build  an 
incorrect  execution  order  that  is  consistent  with  D  and  L.  It  follows  that  D(T)f^TCD, 
which  proves  (2).   □ 

4.   From  Straight  Line  Code  To  Programs 
4.1.   Code  Motion 

The  previous  sections  addressed  the  consistency  problem  from  the  machine  point  of 
view:  we  considered  streams  of  operations  executed  sequentially  by  each  processor.  The 
compiler  point  of  view  is  quite  different:  The  order  of  operation  execution  at  each 
processor  can  be  changed,  as  long  as  data  dependencies  are  not  violated. 

The  example  given  in  the  introduction  shows  that  it  is  not  sufficient  to  consider  only 
data  dependencies  within  each  program  segment  in  order  to  decide  on  the  validity  of  code 
motions.  In  that  example  there  are  no  data  dependencies  within  the  program  segment,  yet 
changing  the  order  of  the  operations,  can  cause  the  set  of  possible  outcomes  to  change. 

Ultracomputer  Note  96  Page  23 


The  construction  of  the  previous  section  can  be  used  to  decide  on  the  validity  of  code 
motion.  If  there  is  no  delay  between  operations  ii  and  v  of  the  same  program  segment  ther 
the  order  of  execution  of  these  two  operations  in  memory  is  arbitrary.  In  particular,  we 
can  interchange  the  order  they  are  issued.  Any  order  of  issuing  that  satisfies  the 
requirement  that  if  uDv  then  v  is  not  issued  until  u  has  been  completed  is  acceptable.  The 
set  of  possible  outcomes  will  be  identical  to  the  set  of  possible  outcomes  resulting  from  the 
execution  of  the  original  program.  However,  it  is  important  to  realize  that  we  enforce 
consistency  with  respect  to  the  original  program,  not  the  new  program.  For  example,  the 
operations  of  the  following  program  segments  can  be  executed  in  arbitrary  order: 

Segment  1  Segment  2 
Pi  X  :=  1;  qi  x  :=  X; 
P2Y-I;      ^2y-=Y; 

If  initially  X  =  Y  =  0,  then  any  of  the  four  outcomes  x  =  0,1,  y  =  0,1  is  valid.  It  is  valid  to 
reverse  the  operations  in  the  second  program  segment,  thus  obtaining  the  code  given  in  the 
first  example  in  the  introduction.  Again,  all  four  possible  outcomes  may  arise  as  a  result 
of  an  execution  of  the  new  program.  But  only  three  of  them  are  consistent  with  the  new- 
program. 

4.2.  Partial  Orders 

In  most  of  our  examples,  P  is  a  union  of  chains:  operations  in  each  program  segment 
are  totally  ordered.    However,  we  never  actually  used  any  stronger  fact  than  that 

1)  P  is  a  partial  order  relating  every  conflicting  pair  of  operations  within  a  program 
segment;  and 

2)  P  only  relates  operations  in  the  same  program  segment. 

We  can  even  relax  the  first  requirement  by  adding  the  condition  on  our  delay  predicates 
that  either  a  lock  or  D  must  order  every  pair  of  conflicting  operations  within  a  program 
segment.   We  now  deal  with  the  second  requirement. 

4.3.  Looping  Programs 

In  tlie  general  case  our  program  segments  will  contain  jump  statements,  due  to  branch 
and  loop  constructs.  We  represent  possible  control  flow  in  each  program  segment  by  a 
flow   graph   (see   [ASU]).     An   execution   of  a  program   corresponds  to   a   (possibly   self 
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crossing)  path  in  the  flow  graph. 

One  can  reduce  this  general  setting  to  the  case  of  straight-line  code  by  "decoupling" 
execution  of  distinct  blocks  within  the  same  code  segment: 

One  can  require  that  the  execution  of  a  new  block  does  not  start  until  memory  references 
issued  by  instructions  in  other  blocks  of  this  code  have  been  satisfied.  For  each  tuple  of 
blocks,  each  belonging  to  another  program  segment,  delays  are  introduced  that  enforce 
correct  concurrent  execution  of  these  blocks. 

It  is  also  possible  to  do  global  optimization  of  delays: 
Let  uPv  if  u  and  v  are  operations  in  the  same  program  segment,  and  there  is  a  path  from  u 
to  V  in  the  flow  graph  of  this  program.  P  is  transitive.  However,  P  may  not  be  acyclic;  in 
particular  we  may  have  uPu.  To  the  previously  defined  conflict  relation  C  we  add  pairs 
uu,  if  w  is  in  a  cycle  of  the  flow  graph,  i.e.  if  uPu.  (This  extension  allows  us  to  generalize 
our  previous  analysis  directly,  but  it  may  seem  strange  since  u  may  be  a  read.  Intuitively, 
the  conflict  arises  because  we  don't  want  one  instantiation  of  a  loop  to  catch  up  with  a 
previous  one.) 

If  a  delay  is  inserted  between  operation  u  and  operation  v  then  operation  v  is  not 
issued  as  long  as  there  is  an  executing  instance  of  operation  u.  A  delay  relation  D  forces 
sequential  consistency  if  for  any  choice  of  an  execution  path  within  each  program  segment 
the  concurrent  execution  of  these  paths  is  sequentially  consistent. 

We  define  an  extended  mixed  cycle  to  be  a  simple  cycle  ct  in  P\JC  that  is  neither 
contained  in  P  nor  in  C .   Border  pairs  are  defined  as  before. 

Theorem  6:  Let  B  be  the  relation  define  by  border  pairs  of  extended  cycles.  Then  B  forces 
sequential  consistency. 

Proof:  Consider  the  set  of  execution  paths  of  the  program.  Let  V  be  the  set  of  instances 
of  instructions  occurring  in  these  paths,  and  let  P'  be  the  partial  order  defined  on  V  by 
these  execution  paths:  u'P'v'  if  u'  occurs  before  v'  on  some  execution  path.  Let  C  be  the 
relation  induced  by  C  on  V.  Any  simple  mixed  cycle  ct'  in  P'\JC'  corresponds  to  an 
extended  mixed  cycle  a  in  P{JC\  any  border  pair  m'v'  in  cj'  corresponds  to  a  border  pair 
uv  in  CT.   The  claim  follows  now  from  theorem  1.    n 

The  definition  of  B  in  the  last  theorem  does  not  seem  to  be  general  enough.  It  might 
seem  to  be  possible  for  there  to  be  self-crossing  cycles  whose  border  pairs  would  not  be 
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detected.  Note,  however,  that  uCu  for  any  operation  u  in  a  cycle  of  the  flow  graph. 
Suppose  (uq,  ■  ■  ■  ,u„-i,uq)  is  a  mixed  cycle  (whose  edges  are  neither  contained  in  P  nor  in 
C),  Ho"i  is  a  border  pair  of  that  cycle,  and  Ui=Uj.,  where  i<j .  Then 
(uq,  ■  ■  ,Ui,Uj+i,  ■  •  •  ,u„-i,uo)  is  also  a  mixed  jycle  with  border  pair  mo"i-  We  can  easily 
see  that  any  mixed  cycle  consists  of  simple  cycles  and  any  border  pair  in  that  mixed  cycle 
will  be  a  border  pair  in  one  of  the  (simple)  extended  mixed  cycles.  Therefore,  it  is 
sufficient  to  consider  simple  mixed  cycles. 

The  set  of  cycles  can  be  further  reduced  by  considering  only  cycles  that  may 
correspond  to  minimal  mixed  cycles  of  (P',C),  for  som;2  execution  paths.  In  particular 
one  may  assume  that  cycles  do  not  contain  more  than  two  consecutive  operations  that 
belong  to  the  same  block. 

5.   Some  Practical  Considerations 
5.1.   Essential  Delays 

The  border  pair  relation  defined  in  §2  is  transitive.  The  critical  pair  relation  D 
defined  there  may  also  have  triples  of  operations  such  that  uDv,  vDw,  and  uDw .  However, 
if  we  enforce  the  first  delay  (uDv)  and  the  second  delay  (vDw) ,  the  third  delay  (uDw)  is 
respected  too.  Thus  it  is  not  necessary  to  enforce  all  the  delays  required  by  our  analysis, 
bui  only  a  subset  that  spav/ns  the  required  delay  relation. 

Let  D  be  an  acyclic  relation  defined  on  a  set  V.  An  ordered  pair  of  elements  «v  is  an 
essential  pair  of  D  if  the  longest  path  connecting  u  to  v  in  the  graph  of  D  has  length  one. 
We  leave  to  the  reader  the  proof  of  the  following  lemma. 

Lemma  7:  Let  D '  be  the  set  of  essential  pairs  for  D .   Then 

(1)  The  transitive  closure  of  D' contains  D. 

(2)  Any  relation  D     that  is  contained   in  D  and  fulfills  (T)  contains  D' .   n 

Once  a  correct  delay  relation  D  has  been  computed,  it  is  easy  to  find  the  essential 
delays,  by  running  a  transitive  reduction  algorithm  [AGU]. 

A  similar  result  is  valid  for  delay  and  locking  pairs.  A  delay  oair  uv^D(T)  is  essential 
if  either 
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(1)  uv(:D{T)~T,  and  the  longest  path  from  u  to  v  in  DiT)\JT  has  length  one;  or 

(2)  uv^D(T)f^T,  and  the  longest  path  from  m  to  v  in  D(T)(-^T  has  length  one.  Let  D'(7j 
be  the  set  of  essential  pairs,  it  is  easy  to  see  that  D'(T)  and  T  force  correctness; 
D'(T)  is  the  smallest  set  D  of  delays  such  that  D  and  T  force  correctness. 

5.2.   Delay  Sets 

Our  use  of  delays  presupposes  that  a  processor  issues  memory  requests  without 
waiting  for  replies  from  the  previous  requests.  This  is  in  general  true  for  stores,  but  not 
for  loads:  all  instructions  following  a  load  are  delayed  until  the  load  is  satisfied.  Thus, 
some  delays  are  automatically  enforced  by  the  processor.  It  is  a  simple  matter  to  take  this 
fact  into  consideration  in  our  model:  Only  those  delays  in  D  that  are  not  automatically 
enforced  need  to  be  taken  care  of. 

We  assumed  that  a  processor  may  delay  an  operation  until  some  arbitrary  previous 
operation  has  completed.  The  NYU  Ultracomputer  and  IBM  RP3  have  a  weaker  delay 
mechanism:  The  issuing  of  an  operation  can  be  delayed  until  all  previously  issued 
operations  have  terminated.  This  enforces  a  delay  between  the  operation  and  all  previous 
operations.  Thus,  we  can  specify  a  set  called  the  delay  set.  These  are  operations  that  are 
not  issued  as  long  as  some  previous  operation  executes. 

Let  /  be  a  set  of  operations  in  a  program.  We  associate  with  /  the  delay  relation  8/ 
defined  by  u  8/  v  if  vil  and  uPv.  (Note  that  8/  may  be  defined  on  a  superset  of  /.)  Now 
suppose  we  define  last{D)  to  be  the  set  last(D)  =  {v  :  uviD}.  Clearly,  hlast(D)DD. 
Also,  lastiD)  is  the  smallest  set  with  this  property. 

Lemma  8  Let  D  be  a  delay  relation  that  enforces  sequential  consistency.  Then  so  does  the 
delay  set  last  (D).  Moreover,  last  (D)  is  the  smallest  delay  set  that  enforces  all  the  delay 
pairs  in  D . 

The  number  of  delays  induced  by  the  delay  set  8D  can  be  reduced  by  moving  code, 
consistently  with  D.  Yet,  even  with  code  motion,  the  use  of  a  delay  set  may  result  in 
unnecessary  delays.  In  the  following  example,  b  must  be  delayed  until  a  completes  because 
of  the  cycle  between  segments  1  and  2;  also,  d  must  be  delayed  until  c  completes  because 
of  the  cycle  between  segments  1  and  3.  But  a  and  b  should  be  able  to  proceed 
coindependently  with  c  and  d.  The  weaker  delay  mechanism  would  forbid  this  possibility: 
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Either  b  is  delayed  until  c  and  d  ccmplete,  or  d  is  delayed  until  a  and  b  complete. 

Segment  1  Segment  2  Segment  3 

(a)  Read  A       (b')  Write  B       (d")  Write  D 

(b)  Read  B       (a')  Write  A       (c")  Write  C 

(c)  Read  C 

(d)  Read  D 

6.   Conclusion 

This  paper  presents  a  method  to  enforce  efficient  and  sequentially  consistent  execution 
of  concurrent  processes  on  a  shared-memory  mukiprocessor  when  memory  access  is 
asynchronous.  Our  method  determines  when  consecutive  operations  in  the  same  program 
segment  of  a  parallel  program  may  execute  concurrently,  without  violating  the 
programmer's  view  that  each  segment  e.vocutes  in  its  given  program  order.  An  actual 
implementation  of  this  method  depends  on  two  factors  that  were  not  discussed. 

First,  one  should  be  able  to  detect  data  dependencies.  This  is  a  problem  faced  by  any 
optunizing  compiler,  and  sophisticated  methods  have  been  developed  for  that  purpose 
[ASU].  In  particular,  index  analysis  to  discover  data  dependencies  across  loop  iterations 
[AK]  are  relevant  to  our  purpose.  In  general,  the  compiler  will  not  be  able  to  detect 
existing  data  dependencies  accurately,  but  will  have  to  assume  further  dependencies. 
These  extra  data  dependencies  reduce  the  efficiency  of  the  code  produced,  but  do  not  affect 
its  correctness. 

Second,  one  has  to  find  all  the  cycles  in  a  graph.  This  requires  time  exponential  in  the 
number  of  nodes  in  a  general  graph.  However,  the  graphs  arising  from  program  segments 
have  a  restricted  structure  and  it  seems  that  the  cycles  can  be  detected  in  polynomial  time. 
A  polynomial  time  algorithm  for  detection  of  border  pairs  of  mixed  cycles  can  be  given 
when  each  program  is  straight-line;  this  algorithm  extends  to  the  code  obtained  from  high 
level  language  program  segments  with  bounded  nesting  of  conditionals. 

Our  analysis  presupposes  that  the  processor  has  the  ability  to  delay  the  issuing  of  an 
instruction  until  some  previous  instruction  have  been  executed.  Pipelined  processors  often 
have  such  locking  mechanisms.  Moreover,  the  processor  detects  by  itself  data 
dependencies,    and    uses    the    locks    to    enforce    correct    sequencing    of   data    dependent 
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operations. 

The  results  of  this  paper  suggest  that  a  more  general  mechanism  is  required  for 
processors  used  in  a  multiprocessor  environment.  Rather  than  checking  for  data 
dependencies  by  itself,  the  processor  should  use  data  dependency  information  provided  by 
the  compiler.  For  each  operation  it  should  be  possible  to  specify  one,  or  a  few  preceding 
operations  that  have  to  complete  before  the  new  operation  is  executed.  The  locking 
mechanism  will  be  based  on  this  information.  In  a  uniprocessor  environment,  delays 
would  exactly  correspond  to  data  dependencies;  the  extra  flexibility  could  be  used  to  allow 
pipelining  of  operations  involving  memory  accesses  (pipelined  processors  usually  assume 
that  any  two  accesses  to  memory  may  conflict).  In  a  multiprocessor  environment 
"gratuitous"  data  dependencies  would  be  added  to  reflect  interprocess  interferences. 
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a  =  b  =  0 


Segment  1  Segment  2 

a  :  =  1  my b  :  =  b 

b  :  =  1  my a  :  =  a 


Want  execution  to  be  equivalent  to  some 
interleaving  of  Segment  1  and  Segment  2. 

If  operations  of  either  segment  are 
executed  out  of  order,  we  may  get 
^y  _^  3  =  0  and  my  __  b  =  1  which  is  not 
equivalent  to  any  interleaving. 

Figure  2 
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Cycle  (a,b,c,d,e,f,g,a)  is  not 
minimal  because  a  and  c  are 
not  consecutive.  (a,c,b,a)  is  a 
shorter  cycle  as  is  (a,c,d,e,f,g,a). 
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Boxes  indicate  atomic  blocks. 
Mixed  cycle  in  (PnA,  CDA) 
(W1(b),R1(b),R1(a),W1(a), 
W1(b)) 

(PnA.  CnA)  has  a  cycle 

(W1(b),  W1(a),  R2(a),  R2(b),  W1(b)). 

So  W1(a)and  W1(b)  are  tight 
but  R2(a)  and  R2(b)  are  not. 

So,  lock  a  and  b  for  W1(a)  and  \A/1(b). 
Delay  R1(a)  for  R1(b).  Delay  R2(b)  for 
R2(a). 

Figure  7 
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Mixed  cycle  in  (PUA,  C-A) 

(W3(b),  R3(c),  W2(c),  W2(a),  R1(a)  R1(b),  W3(b)). 

Delays  shown  on  Figure  9b. 

Note  that  this  reverses  their  order. 
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Delays  to  handle  mixed  cycle  of  Figure  9a. 

Notice  that  in  segment  2,  delay  write  a 
until  write  b  occurs. 
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